Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

نویسندگان

چکیده

As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs TPUs. Each CU processes a sub-part of the model synchronizes results with others. Communication among these CUs has emerged key bottleneck in process. In this work, we present SiPAC, Silicon Photonic Accelerated Compute cluster. SiPAC accelerates DL by means two co-designed components: photonic physical layer novel collective algorithm. The exploits embedded photonics to bring peta-scale I/O directly optimized cluster uses resonator-based optical wavelength selectivity realize hardware multi-casting. algorithm builds on multi-casting primitive. This combination expedites variety communications commonly employed potential drastically ease communication bottlenecks. We demonstrate feasibility realizing architecture through 1) an testbed experiment where array comb laser wavelengths shuffled cascaded ring switch, each selecting forwarding increase effective bandwidth hence demonstrating multicasting primitive, 2) four-GPU running realistic workload that achieves 22% system-level performance improvement relative similarly-sized leaf-spine topology. Large scale simulations show 1.4× 5.9× time reduction compared state-of-the-art compute clusters for representative communications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Architecture and Applications for a Distributed Embedded Firewall

The distributed firewall is an important new line of network defense. It provides fine-grained access control to augment the protections afforded by the traditional perimeter firewall. To be effective, though, a distributed firewall must satisfy two critical requirements. First, it must embrace a protection model that acknowledges that everything behind the firewall may not be trustworthy. The ...

متن کامل

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...

متن کامل

The Willow Architecture: Comprehensive Survivability for Large-Scale Distributed Applications

The Willow architecture is a comprehensive approach to survivability in critical distributed applications. Survivability is achieved in a deployed system using a unique combination of (a) fault avoidance by disabling vulnerable network elements intentionally when a threat is detected or predicted, (b) fault elimination by replacing system software elements when faults are discovered, and (c) fa...

متن کامل

Peta-Scale Computing

In a few short years, computers capable of over one Petaflops performance will become a reality. The most likely approach for first successfully reaching this performance level will involve several thousands of parallel processing elements. What are the key considerations for building such systems? What are the software requirements and demands? How will applications scale? How reliable are the...

متن کامل

Performance-Optimum Superscalar Architecture for Embedded Applications

Embedded applications are widely used in portable devices such as wireless phones, personal digital assistants, laptops, etc. High throughput and real time requirements are especially important in such data-intensive tasks. Therefore, architectures that provide the required performance are the most desirable. On the other hand, processor performance is severely related to the average memory acc...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Lightwave Technology

سال: 2023

ISSN: ['0733-8724', '1558-2213']

DOI: https://doi.org/10.1109/jlt.2023.3276588